In this project, we aim to understand the effect of time until cyropreservation (room temperature or 4ºC) on single-cell transcriptional profiles of chronic lymphocytic leukemia (CLL) cells.
To that end, we drew blood from 3 CLL patients (ids: 1220, 1472, 1892) and cyropreserved the samples after 0h (fresh), 2h, 4h, 6h, 8h and 24h at either room temperature (RT) or 4ºC (4C). To eliminate batch effects, detect doublets and reduce the library cost, we performed cell hashing. In this protocol, each condition (in our case time-points) is labeled with a specific hashtag oligonucleotide (HTO) that is crosslinked with an antibody. The antibodies bind to ubiquitous cell surface markers, and the HTO are sequenced alongside the single-cell gene expression libraries. We have the following samples:
For each of them, we have 3 files: the expression matrix in sparse format, the list of the barcodes that identify the columns, and the list of genes that identify the rows (features). Moreover, the features file contains a column that distinguishes between genes (“Gene Expression”) and HTO (“Antibody Capture”).
The objective of this notebook is to demultiplex the barcodes (cells) back to its original time-point. To achive that, we will follow the pipeline from Seurat:
library(scater)
library(SingleCellExperiment)
library(Seurat)
library(Matrix)
library(tidyverse)
# Load expression matrix, gene and cell metadata
libraries <- c("1472_RT", "1472_4C", "1892_RT", "1892_4C", "1220_RT")
cll_list <- list()
for (lib in libraries) {
lib_path <- str_c("data/BCLLATLAS_03/", lib, "/filtered_feature_bc_matrix/")
expression_matrix <- readMM(str_c(lib_path, "matrix.mtx.gz"))
barcodes <- read_csv(str_c(lib_path, "barcodes.tsv.gz"), col_names = FALSE)
colnames(barcodes) <- "barcode"
features <- read_tsv(str_c(lib_path, "features.tsv.gz"), col_names = FALSE)
colnames(features) <- c("ensembl", "symbol", "feature_type")
rownames(expression_matrix) <- features$symbol
colnames(expression_matrix) <- barcodes$barcode
# Separate HTO and RNA matrices
hto_ind <- which(str_detect(features$feature_type, "Antibody Capture"))
rna_ind <- which(str_detect(features$feature_type, "Gene Expression"))
cll_hto <- expression_matrix[hto_ind, ]
cll_rna <- expression_matrix[rna_ind, ]
# Setup Seurat object
cll <- CreateSeuratObject(counts = cll_rna)
# Normalize RNA data with log normalization
cll <- NormalizeData(cll)
# Find and scale variable features
cll <- FindVariableFeatures(cll, selection.method = "vst")
cll <- ScaleData(cll, features = VariableFeatures(cll))
# Add HTO as an independent assay
cll[["HTO"]] <- CreateAssayObject(counts = cll_hto)
cll <- NormalizeData(cll, assay = "HTO", normalization.method = "CLR")
# Demultiplex
cll <- HTODemux(cll, assay = "HTO", positive.quantile = 0.99)
# Append to list of Seurat objects
cll_list[[lib]] <- cll
}
We can visualize the results as ridge plots or heatmaps:
ridge_l <- map(cll_list, function(cll) {
Idents(cll) <- "HTO_maxID"
RidgePlot(
cll,
assay = "HTO",
features = rownames(cll[["HTO"]])[1:6],
ncol = 2
)
})
heatmap_l <- map(cll_list, function(cll) {
HTOHeatmap(cll, assay = "HTO", ncells = 5000)
})
# # Save
# walk2(ridge_l, names(ridge_l), function(ridge, lib) {
# ggsave(
# filename = str_c("results/plots/", lib, "_hashtag_demux_ridge.pdf"),
# plot = ridge, height = 9,
# width = 16
# )
# })
# walk2(heatmap_l, names(heatmap_l), function(heat, lib) {
# ggsave(
# filename = str_c("results/plots/", lib, "_hashtag_demux_heatmap.png"),
# plot = heat,
# height = 4,
# width = 7
# )
# })
ridge_l
## $`1472_RT`
##
## $`1472_4C`
##
## $`1892_RT`
##
## $`1892_4C`
##
## $`1220_RT`
heatmap_l
## $`1472_RT`
##
## $`1472_4C`
##
## $`1892_RT`
##
## $`1892_4C`
##
## $`1220_RT`
As we can see, there is a high signal-to-noise ratio for every HTO: the samples are easily identifible and there is no cross-contamination between hashtags. Overall, the number of cells is evenly distributed across time-points, which makes it easy to detect heterotypic doublets (doublets from different time-points). However, for the donor 1892, the cells assigned to 0h involve a larger fraction of the total, so we can expect an increased proportion of undetectable 0h-0h doublets.
Finally it is useful to get an overview of the number of singlets, doublets and negative cells per library:
hto_levels <- c("Negative", "Singlet", "Doublet")
cll_gg <- map(names(cll_list), function(id) {
cll_list[[id]]@meta.data %>%
group_by(HTO_classification.global) %>%
summarise(count = n()) %>%
mutate(HTO_classification.global = factor(
HTO_classification.global,
levels = hto_levels
)) %>%
ggplot(aes(HTO_classification.global, count)) +
geom_col() +
geom_text(aes(label = count),
position = position_dodge(width = 0.9), vjust = -0.25) +
labs(title = str_to_title(id), x = "", y = "number of cells") +
theme_bw() +
theme(axis.text.x = element_text(size = 11),
plot.title = element_text(hjust = 0.5))
})
cll_gg
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
There is a low proportion of doublets and negative cells. Thus, we can conclude that we were able to successfully demultiplex cells and obtain high-quality data.
To merge the 4 Seurat objects into one we use the merge command:
cll_merged <- merge(
cll_list$`1472_RT`,
y = c(cll_list$`1472_4C`, cll_list$`1892_RT`, cll_list$`1892_4C`, cll_list$`1220_RT`),
add.cell.ids = libraries,
project = "CLL_benchmarking"
)
# Recode and retain important variables
cll_merged$donor <- str_remove(colnames(cll_merged), "_.*$")
cll_merged$time <- str_remove(cll_merged$hash.ID, "....-..-")
cll_merged$temperature <- cll_merged$hash.ID %>%
str_remove("^....-") %>%
str_remove("-.+h$")
selection <- c("nCount_RNA", "nFeature_RNA", "HTO_classification", "hash.ID",
"time", "donor", "temperature")
cll_merged@meta.data <- cll_merged@meta.data[, selection]
Finally, we can save it as .RDS for future analysis:
# saveRDS(cll_merged, "results/R_objects/cll_Seurat_demultiplexed.rds")
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.0 stringr_1.4.0 dplyr_0.8.4 purrr_0.3.3 readr_1.3.1 tidyr_1.0.2 tibble_2.1.3 tidyverse_1.3.0 Matrix_1.2-18 Seurat_3.1.4 scater_1.14.6 ggplot2_3.3.0 SingleCellExperiment_1.8.0 SummarizedExperiment_1.16.1 DelayedArray_0.12.2 BiocParallel_1.20.1 matrixStats_0.55.0 Biobase_2.46.0 GenomicRanges_1.38.0 GenomeInfoDb_1.22.0 IRanges_2.20.2 S4Vectors_0.24.3 BiocGenerics_0.32.0 BiocStyle_2.14.4
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.1.5 sn_1.5-5 plyr_1.8.6 igraph_1.2.4.2 lazyeval_0.2.2 splines_3.6.1 listenv_0.8.0 TH.data_1.0-10 digest_0.6.25 htmltools_0.4.0 magick_2.3 viridis_0.5.1 fansi_0.4.1 gdata_2.18.0 magrittr_1.5 cluster_2.1.0 ROCR_1.0-7 globals_0.12.5 modelr_0.1.6 RcppParallel_4.4.4 sandwich_2.5-1 colorspace_1.4-1 rvest_0.3.5 ggrepel_0.8.1 haven_2.2.0 xfun_0.12 crayon_1.3.4 RCurl_1.98-1.1 jsonlite_1.6.1 survival_3.1-8 zoo_1.8-7 ape_5.3 glue_1.3.1 gtable_0.3.0 zlibbioc_1.32.0 XVector_0.26.0 leiden_0.3.3 BiocSingular_1.2.2 future.apply_1.4.0 scales_1.1.0 mvtnorm_1.1-0 DBI_1.1.0 bibtex_0.4.2.2 Rcpp_1.0.3 metap_1.3 plotrix_3.7-7
## [48] viridisLite_0.3.0 reticulate_1.14 rsvd_1.0.3 tsne_0.1-3 htmlwidgets_1.5.1 httr_1.4.1 gplots_3.0.3 RColorBrewer_1.1-2 TFisher_0.2.0 ica_1.0-2 farver_2.0.3 pkgconfig_2.0.3 uwot_0.1.5 dbplyr_1.4.2 labeling_0.3 tidyselect_1.0.0 rlang_0.4.5 reshape2_1.4.3 cellranger_1.1.0 munsell_0.5.0 tools_3.6.1 cli_2.0.2 generics_0.0.2 broom_0.5.5 ggridges_0.5.2 evaluate_0.14 yaml_2.2.1 npsurv_0.4-0 fs_1.3.2 knitr_1.28 fitdistrplus_1.0-14 caTools_1.18.0 RANN_2.6.1 pbapply_1.4-2 future_1.16.0 nlme_3.1-145 xml2_1.2.2 rstudioapi_0.11 compiler_3.6.1 beeswarm_0.2.3 plotly_4.9.2 png_0.1-7 lsei_1.2-0 reprex_0.3.0 stringi_1.4.6 lattice_0.20-40 multtest_2.42.0
## [95] vctrs_0.2.3 mutoss_0.1-12 pillar_1.4.3 lifecycle_0.1.0 BiocManager_1.30.10 Rdpack_0.11-1 lmtest_0.9-37 RcppAnnoy_0.0.15 BiocNeighbors_1.4.2 data.table_1.12.8 cowplot_1.0.0 bitops_1.0-6 irlba_2.3.3 gbRd_0.4-11 patchwork_1.0.0 R6_2.4.1 bookdown_0.18 KernSmooth_2.23-16 gridExtra_2.3 vipor_0.4.5 codetools_0.2-16 MASS_7.3-51.5 gtools_3.8.1 assertthat_0.2.1 withr_2.1.2 sctransform_0.2.1 mnormt_1.5-6 multcomp_1.4-12 GenomeInfoDbData_1.2.2 hms_0.5.3 grid_3.6.1 rmarkdown_2.1 DelayedMatrixStats_1.8.0 Rtsne_0.15 lubridate_1.7.4 numDeriv_2016.8-1.1 ggbeeswarm_0.6.0